Relatively easy to
count deaths in the general population (mortality).
count how many people have a disease (prevalence)
Less easy to measure (e.g. cohort studies more expensive than surveys)
risk of getting a disease [over a period of time] (incidence)
risk of death for a person with a given disease (case fatality)
how all these quantities vary in population subgroups
Multistate lifetable models and typical sources of data
Methods for inferring e.g. incidence, remission, case fatality
when these not observed directly, but we have observed prevalence, mortality
Statistical inference, uncertainty quantification, Bayesian methods…
The disbayes R software - practical exercise
Key assumption: probability of death from other cause is independent of disease status
Include all diseases relevant to health impact question
Each disease-specific model determined independently given cause-specific mortality data
Models combined assuming independence (ref Belen’s bit)
Consider diseases independently of each other and aim to obtain this model
Age-specific incidence \(i(a)\), remission rate \(r(a)\) and case fatality \(f(a)\)
Rate: expected number of events per person-time of follow-up
Rates let you simulate data over time – prevalence doesn’t
Varies with the setting but…
Population mortality data (by cause)
Prevalence of a disease
Incidence and case fatality less common (e.g. cohort studies)
Note the distinction:
mortality: number of deaths from disease / people in the population
case fatality: number of deaths from disease / people with the disease
(Institute for Health Metrics and Evaluation, Washington, Seattle, USA)
Publishes estimates every few years (most recently 2021) of
incidence, prevalence, cause-specific mortality, etc.
for most countries, sometimes for regions within countries, by age, sex [ will Belen cover this?]
Produced from a complex (opaque) statistical model, which aims to ensure
consistency between different sources of data on the same quantity
comparability between different (e.g.) countries: differences due to real differences in populations, not due to biases / noise in data
Global Burden of Disease study also published methodology and software for estimating disease burden
DisMod II software and paper Barendregt et al. (2003) A generic model for the assessment of disease epidemiology: the computational basis of DisMod II, Pop. Health Metrics
DisMod-MR software and related textbook Flaxman (2015) “An integrative metaregression framework for descriptive epidemiology”
Our group (MRC-BSU / MRC Epi) developed an R package and methodology inspired by these, but open-source, comprehensively documented and accessible: disbayes
Jackson, Zapata-Diomedi & Woodcock (2023) “Bayesian multistate modelling of incomplete chronic disease burden data” (J Royal Stat Stoc A)
disbayes model and software for estimating disease transition rates
Statistical principles behind the model
Different assumptions that can be made
Using the R package
Practical exercises
Aim to estimate the parameters, age-specific rates \(i(a)\), \(r(a)\), \(f(a)\)
…given observed data, are assumed to be generated randomly from a statistical model with these parameters
Parameters \(\theta\) considered to have probability distributions
Start with a prior distribution \(p(\theta)\), update with evidence from the data likelihood \(f(y | \theta)\), producing the posterior \(p(\theta | y)\)
Shows uncertainty, and how it is reduced with information
Example: posterior for a proportion (e.g. disease prevalence).
Start with a \(Beta(a=1,b=1)\) prior
Observe \(r\) cases out of \(n\) people (“binomial” likelihood)
Get a \(Beta(a+r,b+n-r)\) posterior distribution
We have at least prevalence, cause-specific population mortality, possibly also incidence and/or remission, all by age.
For statistical inference, must acknowledge the data are summaries of a finite population, so come with some uncertainty
For this method, disease data must come in either of two forms:
estimate and credible interval: “prevalence of dementia in men aged 80 is 10%, with 95% credible interval of (8%–12%)”
event count and denominator: “out of 1000 men aged 80 in the area, 100 have dementia”
We now show these two forms are roughly equivalent - so can convert between them
Posterior distribution, given count of \(r\) events out of denominator \(n\), and a weak \(Beta(0,0)\) prior, is \(Beta(r, n-r)\)
Posterior mean (best estimate): \(r/n\)
Posterior SD: known function of \(r,n\).
95% credible interval has width \(\approx 4 \times\) SD
Given estimate and a CI, can deduce the count \(r\) and denominator \(n\) (software can do this)
(could also use \(Beta(1,1)\) prior and \(Beta(r+1,n-r+1))\) posterior
Health impact models typically have a time unit of 1 year of age
Disease data often come in age groups, rather than by year of age
Suppose we have an estimate of 0.1 (95% CI 0.08 to 0.12) for a disease prevalence in age group 55-59
Convert to a count: 100 cases out of 1000.
Easiest to assume equal spread
| Age | Cases | Sample size |
|---|---|---|
| 55 | 20 | 200 |
| 56 | 20 | 200 |
| 57 | 20 | 200 |
| 58 | 20 | 200 |
| 59 | 20 | 200 |
Acknowledges year-specific data are from smaller sample size
More advanced: make counts smoothly varying by age. Can use R tempdisagg package. See the disbayes vignette
Estimates and credible intervals from Global Burden of Disease, converted to counts.
Aim to deduce case fatality rates from these

(available in the disbayes package in the ihdengland dataset)
TODO
10-year survival probabilities published for cancers
Convert to annual probabilities with some assumption on uncertainty
Example
Each count data-point \(r^{(...)}\) with denominator \(n^{(...)}\) comes from a Binomial distribution with some probability \(p^{(...)}\):
| Measure | Number of people of age \(a\)… | Probability \(p^{(...)}\) |
|---|---|---|
| Prevalence | with the disease | \(p^{(prev)}\) |
| Incidence | getting the disease before \(a+1\) | \(p^{(inc)}\) |
| Cause-specific mortality | dying from the disease before \(a+1\) | \(p^{(mort)}\) |
\(p^{(prev)},p^{(inc)},p^{(mort)}\) are all deterministic functions of the transition probability matrix \(P_j\) for ages \(j=0,...,a\)
\(P_j\): matrix with \((r,s)\) entry: probability person is in state \(s\) at age \(j+1\), given they are in state \(r\) at age \(j\).
Deterministic function of the rates \(i(a), f(a),...\)
So given all count data \(r^{(...)},n^{(...)}\), we can estimate the rates
We assume “our data \(X\) were generated from a model with parameters \(\theta\)
Statistical inference works backwards: then given \(X\) we infer what parameters \(\theta\) are most likely to have generated that data
Bayesian inference: use a procedure called Markov Chain Monte Carlo to sample a sequence of parameters from the posterior distribution.
Approximate Bayesian inference: use optimisation to find the peak of the posterior, then approximate the shape. At least 10 x quicker.
disbayes R package uses the Stan software as an engine to do this
TODO Show picture of a distribution — not just a best estimate but uncertainty
Estimate the unknown case fatality rates \(f(a)\) for each age \(a\) from this model.
Assume \(f(a)\) are independent for each age, with vague prior distributions.
Problem is that very few get IHD below age 50, so no information on case fatality
Can do better by adding plausible assumptions…
Assume rates are smooth functions of age
This is implemented in disbayes as a spline function
Allows data from one age to give information about nearby ages
More efficient use of data \(\rightarrow\) more precise estimates
Below age 40, also assume rates are equal for all ages
Minimum required data: mortality, plus one of prevalence or incidence as counts
Could also have both prevalence and incidence, and remission if assumed
all supplied as counts
The model will estimate posterior distributions for incidence rates and case fatality rates (and remission if assumed) which best fit the data supplied
However… in practice, estimates will not be precise if data are weak
Constraints should be used wherever plausible…
Example: stomach cancer, incidence \(0-5\) cases (out of 5000) per year of age, compared to 10-50 for IHD.
Attempting to fit model with smooth functions of age, assuming rates equal below age 40, fails (incomprehensible error caused by failure to find a “best estimate”)
Resolved by imposing tighter constraints, e.g. case fatality rates are…
equal below age 70, instead of \(\leq\) age 40
or constant for all ages,
or increasing for all ages?
Or could aggregate data to a larger area, e.g. assuming rates for one city in England are the same as for the whole of England.
The model estimates incidence + case fatality rates that best fit all available count data (prevalence, incidence)
Including current prevalence and current incidence in the same model assumes they are consistent
But current prevalence and mortality are consequences of of past incidence: has incidence changed over time?
Right: estimates of incidence are different if we base these on
current prevalence + incidence (i.e. averaging past and current incidence)
current prevalence only (past incidence)
current incidence data only
If there is evidence of conflict between data sources — use the most relevant / trustworthy source
e.g. use observed incidence in your lifetable model, rather than modelled estimates
Advanced: disbayes model can adjust for time trends in incidence and case fatality rates, though challenging in practice
difficult to determine trends precisely from literature
model adjustment is computationally intensive
Fit a separate disbayes model for men and women, and for each different area of interest (regions of a country)
Advanced alternative: disbayes can fit a hierarchical model
Estimates from areas with smaller sample sizes (noisier data) are smoothed towards estimates from larger areas
Computationally intensive
Separate models for female/male, or one model with a “gender effect” independent of area
TODO
Or better to put this earlier . Coordinate with practical sessions
Input
Data and names of variables
Model assumptions
Approximate or full uncertainty
Output
Posterior – tidy estimates
Examples of plots
[tuning is advanced for this]
TODO
Proportional multistate, cause of death assumptions, modelling other-cause
Markov assumption. Model acute events e.g. (MI and stroke) separately.
Microsimulation is more flexible
TODO coordinate with Belen. Intersperse between lectures
Something to firm up idea of credible intervals versus counts
Run disbayes with a different dataset
Different combinations of data — explain differences
Model checking and awareness of biases
Construct plots and summaries